Objectives

  • Review the mechanics and assumptions of linear regression
  • Identify regression diagnostic tests based on visualization
  • Compare tables vs. graphs for publishing statistical results in academic journals
  • Discuss presenting results in a paper vs. a presentation
library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Linear models and visualization

Linear functional form

Linear models are the simplest functional form to understand. They adopt a generic form

\[Y = \beta_0 + \beta_{1}X\]

where \(y\) is the outcome of interest, \(x\) is the explanatory or predictor variable, and \(\beta_0\) and \(\beta_1\) are parameters that vary to capture different patterns. In algebraic terms, \(\beta_0\) is the intercept and \(\beta_1\) the slope for the linear equation. Given the empirical values you have for \(x\) and \(y\), you generate a fitted model that finds the values for the parameters that best fit the data.

ggplot(sim1, aes(x, y)) + 
  geom_point()

This looks like a linear relationship. We could randomly generate parameters for the formula \(y = \beta_0 + \beta_1 * x\) to try and explain or predict the relationship between \(x\) and \(y\):

models <- tibble(
  a1 = runif(250, -20, 40),
  a2 = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point()

But obviously some parameters are better than others. We need a definition that can be used to differentiate good parameters from bad parameters.

Least squares regression

One approach widely used is called least squares - it means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The errors are simply the difference between the actual values for \(y\) and the predicted values for \(y\) (also known as the residuals).

dist1 <- sim1 %>% 
  mutate(
    dodge = rep(c(-1, 0, 1) / 20, 10),
    x1 = x + dodge,
    pred = 7 + x1 * 1.5
  )

ggplot(dist1, aes(x1, y)) + 
  geom_abline(intercept = 7, slope = 1.5, color = "grey40") +
  geom_point(color = "grey40") +
  geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")

To estimate a linear regression model in R, we use the lm() function. The lm() function takes two parameters. The first is a formula specifying the equation to be estimated (lm() translates y ~ x into \(y = \beta_0 + \beta_1 * x\)). The second is the data frame containing the variables:

sim1_mod <- lm(y ~ x, data = sim1)

We can use the summary() function to examine key model components, including parameter estimates, standard errors, and model goodness-of-fit statistics.

summary(sim1_mod)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.147 -1.520  0.133  1.467  4.652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.221      0.869    4.86  4.1e-05 ***
## x              2.052      0.140   14.65  1.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.2 on 28 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.88 
## F-statistic:  215 on 1 and 28 DF,  p-value: 1.17e-14

The resulting line from this regression model looks like:

dist2 <- sim1 %>%
  add_predictions(sim1_mod) %>%
  mutate(
    dodge = rep(c(-1, 0, 1) / 20, 10),
    x1 = x + dodge
  )

ggplot(dist2, aes(x1, y)) + 
  geom_smooth(method = "lm", color = "grey40") +
  geom_point(color = "grey40") +
  geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")

Residuals and visualizations

The blue lines identify the residuals, the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residuals can be very useful for diagnostic tests with linear regression. The distribution of residuals can tell us many things about our regression model. For instance, consider the linear relationship between displ and hwy in the mpg dataset:

# estimate linear model
mpg_lin <- lm(hwy ~ displ, data = mpg)

# plot the model using geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Not a great model (the relationship isn’t strictly linear - we’ll get to that more in a minute). But sometimes your model is better at predicting some types of observations better than others. For instance, we also know which cars are front wheel vs. rear wheel vs. four wheel drive:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(method = "lm", se = FALSE)

Could our model be more accurate at predicting highway mileage for one of these three categories? We can plot the distributions of the residuals from our original model to find out:

mpg %>%
  add_residuals(mpg_lin) %>%
  ggplot(aes(resid, color = drv)) +
  geom_density()

If the model performed equally well for all three categories, the distributions should look similar to one another (ideally normally distributed centered around 0). It appears here that there are substantial differences in the distributions, suggesting perhaps we should include drv as an additional variable in our regression model in order to improve predictive accuracy:

mpg_drv <- lm(hwy ~ displ + drv, data = mpg)

mpg %>%
  add_residuals(mpg_drv) %>%
  ggplot(aes(resid, color = drv)) +
  geom_density()

By doing so, the residuals now are more similarly distributed.

Testing assumptions of linear regression using visualizations

Non-linearity of the data

Linear regression assumes the relationship between the predictors and the response is a straight line. If the true relationship is otherwise, then we cannot generate accurate inferences from the model.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Doesn’t look linear. We could just look at this graph and make that conclusion (a very basic visualization), but what happens once we estimate a model with more than a single predictor? Instead we can use residual plots to identify non-linearity. Recall that the residual is the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residual plots graph the relationship between the fitted values and the residuals. Ideally there should be no discernable pattern in the graph. If there is a pattern, then this indicates a problem with some aspect of the linear model.

# estimate models
mpg_lin <- lm(hwy ~ displ, data = mpg)
mpg_quad <- lm(hwy ~ poly(displ, 2, raw = TRUE), data = mpg)

mpg %>%
  add_predictions(mpg_lin) %>%
  add_residuals(mpg_lin) %>%
  ggplot(aes(pred, resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = .2) +
  geom_smooth(se = FALSE) +
  labs(title = "Residual plot for linear fit")

mpg %>%
  add_predictions(mpg_quad) %>%
  add_residuals(mpg_quad) %>%
  ggplot(aes(pred, resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = .2) +
  geom_smooth(se = FALSE) +
  labs(title = "Residual plot for quadratic fit")

Non-constant variance of the error terms

Another assumption of linear regression is that the error terms \(\epsilon_i\) have a constant variance, \(\text{Var}(\epsilon_i) = \sigma^2\). This is called homoscedasticity. Estimates of standard errors rely on this assumption. If the variances of the error terms are non-constant (aka heteroscedastic), our estimates of the standard errors will be inaccurate - they will either be inflated or deflated, leading to incorrect inferences about the statistical significance of predictor variables.

We can uncover homo- or heteroscedasticity through the use of the residual plot. Below is data generated from the process:

\[Y = 2 + 3X + \epsilon\]

where \(\epsilon\) is random error distributed normally \(N(0,1)\).

sim_homo <- data_frame(x = runif(1000, 0, 10),
                       y = 2 + 3 * x + rnorm(1000, 0, 1))
sim_homo_mod <- glm(y ~ x, data = sim_homo)

sim_homo %>%
  add_predictions(sim_homo_mod) %>%
  add_residuals(sim_homo_mod) %>%
  ggplot(aes(pred, resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = .2) +
  geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
  labs(title = "Homoscedastic variance of error terms",
       x = "Predicted values",
       y = "Residuals")

Compare this to a linear model fit to the data generating process:

\[Y = 2 + 3X + \epsilon\]

where \(\epsilon\) is random error distributed normally \(N(0,\frac{X}{2})\). Note that the variance for the error term of each observation \(\epsilon_i\) is not constant, and is itself a function of \(X\).

sim_hetero <- data_frame(x = runif(1000, 0, 10),
                       y = 2 + 3 * x + rnorm(1000, 0, (x / 2)))
sim_hetero_mod <- glm(y ~ x, data = sim_hetero)

sim_hetero %>%
  add_predictions(sim_hetero_mod) %>%
  add_residuals(sim_hetero_mod) %>%
  ggplot(aes(pred, resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = .2) +
  geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
  labs(title = "Heteroscedastic variance of error terms",
       x = "Predicted values",
       y = "Residuals")

We see a distinct funnel-shape to the relationship between the predicted values and the residuals. This is because by assuming the variance is constant, we substantially over or underestimate the actual response \(Y_i\) as \(X_i\) increases.

Outliers

An outlier is an observation for which the predicted value \(\hat{y}_i\) is far from the actual value \(y_i\). Sometimes outliers are simply coding errors in the original dataset, but sometimes they are extreme or unusual values that were not generated by the same data generating process as the remaining dataset. Detecting outliers is the first step to deciding how to handle them (i.e. keep or remove them from the analysis).

We can use a few different visualizations to detect outliers. In a bivariate linear regression model, simply plot the variables against one another and draw the best-fit line:

sim <- data_frame(x = rnorm(20),
                  y = -2 + x + rnorm(20)) %>%
  mutate(y = ifelse(row_number(x) == 15, y + 10, y),
         outlier = ifelse(row_number(x) == 15, TRUE, FALSE))

ggplot(sim, aes(x, y)) +
  geom_point(aes(color = outlier)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(data = filter(sim, outlier == FALSE),
              method = "lm", se = FALSE, linetype = 2) +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

Or we can draw another residual plot (either raw or standardized):

sim_lm <- lm(y ~ x, data = sim)

sim %>%
  add_predictions(sim_lm) %>%
  add_residuals(sim_lm) %>%
  ggplot(aes(pred, resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(aes(color = outlier)) +
  labs(x = "Fitted values",
       y = "Residuals") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

augment(sim_lm) %>%
  left_join(sim) %>%
  ggplot(aes(.fitted, .std.resid)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(aes(color = outlier)) +
  labs(x = "Fitted values",
       y = "Standardized residuals") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

High-leverage points

High-leverage points are observations with have strong effects on the coefficient estimates in a linear model. Influence is a function of leverage and discrepancy:

\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]

  • Leverage - degree of potential influence on the coefficient estimates that a given observation can (but not necessarily does) have
  • Discrepancy - extent to which an observation is “unusual” or “different” from the rest of the data

Various statistical measures exist for quantifying leverage and discrepancy. Leverage is frequently defined by the leverage statistic (also known as the hat value):

\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i'=1}^{n} (x_{i'} - \bar{x})^2}\]

Residuals are commonly used to measure discrepancy (either raw, standardized, or studentized). Cook’s distance (or Cook’s D) combines these two measures to calculate an observation’s leverage:

\[D_i = \frac{\tilde{u}_i^2}{K} \times \frac{h_i}{1 - h_i}\]

where \(\tilde{u}_i^2\) is the squared standardized residual, \(K\) is the number of parameters in the model, and \(\frac{h_i}{1 - h_i}\) is the hat value.

Where is the visualization? By combining all three variables into a “bubble plot”, we can visualize all three variables simultaneously. For example, here are the results of a basic model of the number of federal laws struck down by the U.S. Supreme Court in each Congress, based on:

  1. Age - the mean age of the members of the Supreme Court
  2. Tenure - mean tenure of the members of the Court
  3. Unified - a dummy variable indicating whether or not the Congress was controlled by the same party in that period
library(haven)

dahl <- read_dta("data/LittleDahl.dta")
dahl_mod <-lm(nulls ~ age + tenure + unified, data = dahl)

dahl_augment <- dahl %>%
  mutate(hat = hatvalues(dahl_mod),
         student = rstudent(dahl_mod),
         cooksd = cooks.distance(dahl_mod))

# use size
ggplot(dahl_augment, aes(hat, student)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(aes(size = cooksd), shape = 1) +
  geom_text(data = dahl_augment %>%
              arrange(-cooksd) %>%
              slice(1:10),
            aes(label = Congress)) +
  scale_size_continuous(range = c(1, 20)) +
  labs(x = "Leverage",
       y = "Studentized residual") +
  theme(legend.position = "none")

# use color and geom_text_repel
ggplot(dahl_augment, aes(hat, student, color = cooksd)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point() +
  ggrepel::geom_text_repel(data = dahl_augment %>%
              arrange(-cooksd) %>%
              slice(1:10),
            aes(label = Congress)) +
  labs(x = "Leverage",
       y = "Studentized residual") +
  theme(legend.position = "none")

The bubble plot tells us several things:

  • The size/color of the symbols is proportional to Cook’s D, which is in turn a multiplicative function of the square of the Studentized residuals (Y axis) and the leverage (X axis), so observations farther away from \(Y=0\) and/or have higher values of \(X\) will have larger symbols.
  • The plot tells us whether the large influence of an observation is due to high discrepancy, high leverage, or both
    • The 104th Congress has relatively low leverage but is very discrepant
    • The 74th and 98th Congresses demonstrate both high discrepancy and high leverage

Publishing statistical visualizations

Descriptive statistics

Mosaic plot

What is the relationship between happiness and gender? We could identify this in several different contingency tables, depending on the probability distribution on which we wish to focus:

# Mosaic plot of happiness and education
library(productplots)
data("happy")

happy <- happy %>%
  na.omit

Joint distribution

  • \(f(\text{happy}, \text{sex})\)
# contingency table
library(descr)
crosstab(happy$happy, happy$sex, prop.t = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                   5.5%     7.1%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  25.2%    30.3%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  13.9%    18.2%        
## ---------------------------------------
## Total           15497    19326   34823 
## =======================================

Conditional distribution of sex given happiness

  • \(f(\text{sex} | \text{happy})\)
  • \(f(\text{happy} | \text{sex})\)
crosstab(happy$happy, happy$sex, prop.r = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |             Row Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  43.6%    56.4%   12.5%
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  45.4%    54.6%   55.4%
## ---------------------------------------
## very happy       4833     6327   11160 
##                  43.3%    56.7%   32.0%
## ---------------------------------------
## Total           15497    19326   34823 
## =======================================
crosstab(happy$happy, happy$sex, prop.c = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  12.3%    12.7%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  56.5%    54.5%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  31.2%    32.7%        
## ---------------------------------------
## Total           15497    19326   34823 
##                  44.5%    55.5%        
## =======================================

Conditional distribution of happiness given sex and marginal distribution of sex

  • \(f(\text{happy})\) and \(f(\text{sex})\)
crosstab(happy$happy, happy$sex, prop.c = TRUE, prop.r = TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |             Row Percent | 
## |          Column Percent | 
## |-------------------------|
## 
## =======================================
##                  happy$sex
## happy$happy       male   female   Total
## ---------------------------------------
## not too happy    1904     2460    4364 
##                  43.6%    56.4%   12.5%
##                  12.3%    12.7%        
## ---------------------------------------
## pretty happy     8760    10539   19299 
##                  45.4%    54.6%   55.4%
##                  56.5%    54.5%        
## ---------------------------------------
## very happy       4833     6327   11160 
##                  43.3%    56.7%   32.0%
##                  31.2%    32.7%        
## ---------------------------------------
## Total           15497    19326   34823 
##                  44.5%    55.5%        
## =======================================

Each of the contingency tables encourages a different type of comparison, therefore the author has to decide in advance which comparison is most important and include the appropriate table. Alternatively, we can visualize this information using a mosaic plot, whereby the area of each rectangle is proportional to the number of observations falling into the respective contengencies.

There are a few different packages available for drawing mosaic plots in R.

graphics::mosaicplot()

mosaicplot(~ sex + happy, data = happy)

vcd::mosaic()

library(vcd)
mosaic(~ happy + sex, happy)

productplots::prodplot()

# mosaic plot using productplots
prodplot(happy, ~ happy + sex)

# add color
prodplot(happy, ~ happy + sex) +
  aes(fill = happy) +
  theme(panel.grid = element_blank())

prodplot(happy, ~ happy + marital) +
  aes(fill = happy) +
  theme(legend.position = "none") +
  theme(panel.grid = element_blank())

Notice that the mosaic plot is very similar to a proportional bar chart:

ggplot(happy, aes(marital, fill = happy)) +
  geom_bar(position = "fill")

However unlike a proportional bar chart, the bar widths are constant and therefore we do not know what proportion of individuals in the survey are married vs. never married, or any other similar comparison.

Dot plot for summary statistics

library(ISLR)

OJ_sum <- OJ %>%
  select(ends_with("MM"), ends_with("CH")) %>%
  gather(var, value) %>%
  group_by(var) %>%
  summarize(mean = mean(value),
            sd = sd(value),
            min = min(value),
            max = max(value),
            n = n())

# print the table
kable(OJ_sum)
var mean sd min max n
DiscCH 0.052 0.117 0.00 0.500 1070
DiscMM 0.123 0.214 0.00 0.800 1070
LoyalCH 0.566 0.308 0.00 1.000 1070
PctDiscCH 0.027 0.062 0.00 0.253 1070
PctDiscMM 0.059 0.102 0.00 0.402 1070
PriceCH 1.867 0.102 1.69 2.090 1070
PriceMM 2.085 0.134 1.69 2.290 1070
SalePriceCH 1.816 0.143 1.39 2.090 1070
SalePriceMM 1.962 0.253 1.19 2.290 1070
SpecialCH 0.148 0.355 0.00 1.000 1070
SpecialMM 0.162 0.368 0.00 1.000 1070
# plot using a single dot plot
ggplot(OJ_sum, aes(x = fct_reorder(var, mean), y = mean)) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1) +
  geom_point() +
  coord_flip() +
  labs(x = NULL,
       y = NULL)

# dodge based on OJ brand
OJ_sum %>%
  separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
  ggplot(aes(x = fct_reorder(var, mean), y = mean, color = brand)) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25,
                 position = position_dodge(width = 0.5)) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1,
                 position = position_dodge(width = 0.5)) +
  geom_point(position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       color = "Brand")

# facet based on OJ brand
OJ_sum %>%
  separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
  ggplot(aes(x = fct_reorder(var, mean), y = mean)) +
  facet_grid(. ~ brand) +
  geom_linerange(aes(ymin = mean - 2 * sd,
                      ymax = mean + 2 * sd),
                  linetype = 2,
                 size = .25) +
  geom_linerange(aes(ymin = mean - sd,
                      ymax = mean + sd),
                  size = 1) +
  geom_point() +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       color = "Brand")

Regression results

Single linear model

library(coefplot)

model <- lm(hwy ~ displ + I(displ^2), data = mpg)

# in a table
tidy(model) %>%
  kable
term estimate std.error statistic p.value
(Intercept) 49.24 1.858 26.51 0
displ -11.76 1.073 -10.96 0
I(displ^2) 1.09 0.141 7.77 0
# in a plot
coefplot(model)

Multiple linear models

displ <- lm(hwy ~ displ + I(displ^2), data = mpg)
displ_cyl <- lm(hwy ~ displ + I(displ^2) + cyl, data = mpg)
displ_cyl_drv <- lm(hwy ~ displ + I(displ^2) + cyl + drv, data = mpg)
displ_cyl_drv_class <- lm(hwy ~ displ + I(displ^2) + cyl + drv + class, data = mpg)

# in a table
library(stargazer)
stargazer(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
          type = "html",
          title = "Regression models on highway fuel efficiency",
          covariate.labels = c("Engine displacement", "$\\text{Engine displacement}\\^2$",
                               "Number of cylinders", "Front wheel drive",
                               "Rear wheel drive", "Compact", "Midsize",
                               "Minivan", "Pickup", "Subcompact", "SUV",
                               "Constant"),
          dep.var.caption = "",
          dep.var.labels.include = FALSE,
          align =TRUE)
Regression models on highway fuel efficiency
(1) (2) (3) (4)
Engine displacement -11.800*** -11.200*** -7.150*** -3.710***
(1.070) (1.410) (1.220) (1.280)
Engine displacement^2 1.090*** 1.060*** 0.686*** 0.381***
(0.141) (0.153) (0.130) (0.135)
Number of cylinders -0.264 -0.746** -1.040***
(0.411) (0.343) (0.315)
Front wheel drive 4.520*** 2.770***
(0.496) (0.619)
Rear wheel drive 4.180*** 1.170
(0.686) (0.734)
Compact -2.780*
(1.510)
Midsize -2.630*
(1.550)
Minivan -6.500***
(1.720)
Pickup -7.380***
(1.470)
Subcompact -2.230
(1.470)
SUV -6.560***
(1.370)
Constant 49.200*** 49.300*** 40.800*** 40.300***
(1.860) (1.860) (1.750) (2.050)
Observations 234 234 234 234
R2 0.672 0.673 0.782 0.838
Adjusted R2 0.670 0.669 0.778 0.830
Residual Std. Error 3.420 (df = 231) 3.430 (df = 230) 2.810 (df = 228) 2.450 (df = 222)
F Statistic 237.000*** (df = 2; 231) 158.000*** (df = 3; 230) 164.000*** (df = 5; 228) 105.000*** (df = 11; 222)
Note: p<0.1; p<0.05; p<0.01
# in a plot
multiplot(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
          title = "Regression models on highway fuel efficiency",
          xlab = NULL,
          ylab = NULL,
          newNames = c("displ" = "Engine displacement",
                       "I(displ^2)" = "Engine displacement^2",
                       "cyl" = "Number of cylinders",
                       "drvf" = "Front wheel drive",
                       "drvr" = "Rear wheel drive",
                       "classcompact" = "Compact",
                       "classmidsize" = "Midsize",
                       "classminivan"   = "Minivan",
                       "classpickup" = "Pickup",
                       "classsubcompact" = "Subcompact",
                       "classsuv" = "SUV")) +
  scale_color_discrete(labels = c("Model 1", "Model 2", "Model 3", "Model 4"))

multiplot(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
          single = FALSE,
          names = c("Model 1", "Model 2", "Model 3", "Model 4"),
          title = "Regression models on highway fuel efficiency",
          xlab = NULL,
          ylab = NULL,
          newNames = c("displ" = "Engine displacement",
                       "I(displ^2)" = "Engine displacement^2",
                       "cyl" = "Number of cylinders",
                       "drvf" = "Front wheel drive",
                       "drvr" = "Rear wheel drive",
                       "classcompact" = "Compact",
                       "classmidsize" = "Midsize",
                       "classminivan"   = "Minivan",
                       "classpickup" = "Pickup",
                       "classsubcompact" = "Subcompact",
                       "classsuv" = "SUV")) +
  theme(legend.position = "none")

Generalized linear models

Visualizations are great tools for presenting and interpreting results from other regression-based models that do not use continuous dependent variables. For instance, consider the Titanic:

library(titanic)
titanic <- titanic_train %>%
  as_tibble()

titanic %>%
  head() %>%
  knitr::kable()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 Q

If we estimate a logistic regression model (or models) of survival on the Titanic, the resulting parameters are very difficult to interpret because they are expressed in terms of log-odds, not something intuitive such as odds or probabilities. For example, a model based on age looks like the following:

survive_age <- glm(Survived ~ Age, data = titanic, family = binomial)
tidy(survive_age)
##          term estimate std.error statistic p.value
## 1 (Intercept)  -0.0567   0.17358    -0.327  0.7438
## 2         Age  -0.0110   0.00533    -2.057  0.0397
coefplot(survive_age)

# generate predicted values
library(rcfss) # need to convert log-odds to probabilities
titanic_age <- titanic %>%
  data_grid(Age) %>%
  add_predictions(survive_age) %>%
  # predicted values are in the log-odds form - convert to probabilities
  mutate(prob = logit2prob(pred))

p_age <- ggplot(titanic_age, aes(Age, prob)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Probability of surviving the Titanic",
       subtitle = "Age",
       y = "Predicted probability of survival",
       color = "Sex")
p_age

By visualizing the results as predicted probabilities (which are curvilinear and do not lend themselves well to presentation in a table), we have a more intuitive understanding of the results. For instance here, as age increases probability of survival decreases. Alternatively, in odds form:

survive_age_pred <- titanic_age %>%
  mutate(odds = prob2odds(prob))

ggplot(survive_age_pred, aes(Age, odds)) +
  geom_line(color = "blue", size = 1) +
  labs(x = "Age",
       y = "Odds of surviving the Titanic")

Regardless of age, the odds of surviving the Titanic are always below 1. Considering the probability of even a 1 year old surviving was less than \(.50\), this should be expected. The relationship between age and the odds of survival is still curvilinear.

This is especially true once we introduce additional variables into the model:

survive_age_woman <- glm(Survived ~ Age + Sex, data = titanic,
                         family = binomial)

titanic_age_sex <- titanic %>%
  data_grid(Age, Sex) %>%
  add_predictions(survive_age_woman) %>%
  mutate(pred = logit2prob(pred))

p_age_sex <- ggplot(titanic_age_sex, aes(Age, pred, color = Sex)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Probability of surviving the Titanic",
       subtitle = "Age + Sex",
       y = "Predicted probability of survival",
       color = "Sex")
p_age_sex

survive_age_woman_x <- glm(Survived ~ Age * Sex, data = titanic,
                           family = binomial)

titanic_age_sex_x <- titanic %>%
  data_grid(Age, Sex) %>%
  add_predictions(survive_age_woman_x) %>%
  mutate(pred = logit2prob(pred))

p_age_sex_x <- ggplot(titanic_age_sex_x, aes(Age, pred, color = Sex)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Probability of surviving the Titanic",
       subtitle = "Age x Sex",
       y = "Predicted probability of survival",
       color = "Sex")
p_age_sex_x

If we want to present the results simultaneously, we have a few options:

# in a table
stargazer(survive_age, survive_age_woman, survive_age_woman_x,
          type = "html",
          title = "Probability of surviving the Titanic",
          covariate.labels = c("Age", "Male", "Age x Male", "Constant"),
          dep.var.caption = "",
          dep.var.labels.include = FALSE,
          align = TRUE)
Probability of surviving the Titanic
(1) (2) (3)
Age -0.011** -0.005 0.020*
(0.005) (0.006) (0.011)
Male -2.470*** -1.320***
(0.185) (0.408)
Age x Male -0.041***
(0.014)
Constant -0.057 1.280*** 0.594*
(0.174) (0.230) (0.310)
Observations 714 714 714
Log Likelihood -480.000 -375.000 -370.000
Akaike Inf. Crit. 964.000 756.000 748.000
Note: p<0.1; p<0.05; p<0.01
# in a coefficient plot
multiplot(survive_age, survive_age_woman, survive_age_woman_x,
          title = "Probability of surviving the Titanic",
          xlab = "Parameter (log-odds)",
          ylab = NULL,
          newNames = c("Sexmale" = "Male",
                       "Age:Sexmale" = "Age x Male")) +
  scale_color_discrete(labels = c("Age", "Age + Sex", "Age x Sex"))

# in a predicted probability plot
bind_rows(list("Age" = titanic_age %>%
                 mutate(pred = prob),
               "Age + Sex" = titanic_age_sex,
               "Age x Sex" = titanic_age_sex_x), .id = "id") %>%
  mutate(Sex = ifelse(is.na(Sex), "both", Sex)) %>%
  # plot the two models
  ggplot(aes(Age, pred, color = Sex, linetype = id)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3")) +
  labs(title = "Probability of surviving the Titanic",
       x = "Age",
       y = "Predicted probability of survival",
       color = "Sex",
       linetype = "Model")

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-04-26                  
## 
##  package      * version    date      
##  assertthat     0.1        2013-12-06
##  backports      1.0.5      2017-01-18
##  broom        * 0.4.2      2017-02-13
##  codetools      0.2-15     2016-10-05
##  coefplot     * 1.2.4.9000 2017-04-25
##  colorspace     1.3-2      2016-12-14
##  DBI            0.6        2017-03-09
##  descr        * 1.1.3      2016-05-11
##  devtools       1.12.0     2016-06-24
##  digest         0.6.12     2017-01-27
##  dplyr        * 0.5.0      2016-06-24
##  evaluate       0.10       2016-10-11
##  forcats      * 0.2.0      2017-01-23
##  foreign        0.8-67     2016-09-13
##  ggplot2      * 2.2.1      2016-12-30
##  gtable         0.2.0      2016-02-26
##  haven        * 1.0.0      2016-09-23
##  hms            0.3        2016-11-22
##  htmltools      0.3.5      2016-03-21
##  httr           1.2.1      2016-07-03
##  ISLR         * 1.0        2013-06-11
##  jsonlite       1.3        2017-02-28
##  knitr        * 1.15.1     2016-11-22
##  labeling       0.3        2014-08-23
##  lattice        0.20-34    2016-09-06
##  lazyeval       0.2.0      2016-06-12
##  lmtest         0.9-35     2017-02-11
##  lubridate      1.6.0      2016-09-13
##  magrittr       1.5        2014-11-22
##  MASS           7.3-45     2016-04-21
##  Matrix         1.2-8      2017-01-20
##  MatrixModels * 0.4-1      2015-08-22
##  memoise        1.0.0      2016-01-29
##  mnormt         1.5-5      2016-10-15
##  modelr       * 0.1.0      2016-08-31
##  munsell        0.4.3      2016-02-13
##  nlme           3.1-131    2017-02-06
##  plyr           1.8.4      2016-06-08
##  productplots * 0.1.1.9000 2017-04-25
##  psych          1.7.3.21   2017-03-22
##  purrr        * 0.2.2      2016-06-18
##  quantreg     * 5.29       2016-09-04
##  R6             2.2.0      2016-10-05
##  rcfss        * 0.1.4      2017-02-28
##  Rcpp           0.12.10    2017-03-19
##  readr        * 1.1.0      2017-03-22
##  readxl         0.1.1      2016-03-28
##  reshape2       1.4.2      2016-10-22
##  rmarkdown      1.3        2016-12-21
##  rprojroot      1.2        2017-01-16
##  rvest          0.3.2      2016-06-17
##  scales         0.4.1      2016-11-09
##  SparseM      * 1.74       2016-11-10
##  stargazer    * 5.2        2015-07-14
##  stringi        1.1.2      2016-10-01
##  stringr      * 1.2.0      2017-02-18
##  tibble       * 1.2        2016-08-26
##  tidyr        * 0.6.1      2017-01-10
##  tidyverse    * 1.1.1      2017-01-27
##  titanic      * 0.1.0      2015-08-31
##  useful         1.2.1      2016-06-29
##  vcd          * 1.4-3      2016-09-17
##  withr          1.0.2      2016-06-20
##  xml2           1.1.1      2017-01-24
##  xtable         1.8-2      2016-02-05
##  yaml           2.1.14     2016-11-12
##  zoo            1.7-14     2016-12-16
##  source                               
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.3)                       
##  Github (jaredlander/coefplot@0755a00)
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.3)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.3)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  cran (@1.0.0)                        
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  cran (@1.15.1)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.3)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.3)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.3)                       
##  CRAN (R 3.3.0)                       
##  Github (hadley/productplots@391f500) 
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  local                                
##  cran (@0.12.10)                      
##  cran (@1.1.0)                        
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.1)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  cran (@1.2)                          
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.0)                       
##  CRAN (R 3.3.2)                       
##  CRAN (R 3.3.0)                       
##  cran (@2.1.14)                       
##  CRAN (R 3.3.2)